“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 datasets, the techniques learned herein will be broadly applicable.
This lesson is the first in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
This week will be your crash-course on Jupyter notebooks and R to refresh on packages and principles that will be relevant throughout our course. In our lectures and your assignments we will be working with some uncurated data to simulate the full experience of working with data from start to finish. It’s important that we are all familiar with, and understand the majority of the tidy data methods that we’ll be using in class so that we can focus on the new material as it appears. We’ll use some standard packages and practices to finesse our data before visualizing it, so let’s R-efresh ourselves.
At the end of this lecture we will have covered the following topics:
tidyverse package.grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your JupyterHub
folders. We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto
JupyterHub. You will need to use your UTORid credentials to complete
the login process. From there you will find each week’s lecture files in
the directory /2024-03-Adv_Graphics_R/Lecture_XX. You will
find a partially coded skeleton.rmd file as well as all of
the data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
Today’s datasets will focus on epidemiological data from the Ontario provincial government found here and here.
This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 cases throughout different public health units in the province. It is in a comma separate format and has been collected since 2020-03-24 through 2023-01-25.
This dataset was obtained from the Ontario provincial website and holds data regarding SARS-CoV-2 throughout 5 Ontario health regions. It is in a comma-separated format and has been growing/expanding since initial tracking started on 2020-04-02 through 2023-02-23. This is a dataset only tracks 7 variables specifically regarding the daily totals of hospitalized COVID-19 patients both in the ICU and general care.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
lubridate and zoo are helper packages used
for working with date formats in R
Let’s run our first code cell!
# Packages to help tidy our data
library(tidyverse)
-- [1mAttaching core tidyverse packages[22m ---------------------------------------------------------------- tidyverse 2.0.0 --
[32mv[39m [34mdplyr [39m 1.1.0 [32mv[39m [34mreadr [39m 2.1.4
[32mv[39m [34mforcats [39m 1.0.0 [32mv[39m [34mstringr [39m 1.5.0
[32mv[39m [34mggplot2 [39m 3.4.3 [32mv[39m [34mtibble [39m 3.2.1
[32mv[39m [34mlubridate[39m 1.9.2 [32mv[39m [34mtidyr [39m 1.3.0
[32mv[39m [34mpurrr [39m 1.0.2
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mpurrr[39m::[32mis_empty()[39m masks [34mgit2r[39m::is_empty()
[31mx[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
[31mx[39m [34mdplyr[39m::[32mpull()[39m masks [34mgit2r[39m::pull()
[31mx[39m [34mpurrr[39m::[32mwhen()[39m masks [34mgit2r[39m::when()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
# Packages for the graphical analysis section
library(viridis)
Loading required package: viridisLite
# packages used for working with/formating dates in R
library(lubridate)
library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Work with your R markdown notebook on the University of Toronto datatools hub will all be contained within a new browser tab with the address bar showing something similar to
https://r.datatools.utoronto.ca/user/calvin.mok@utoronto.ca/rstudio/
All of this is running remotely on a University of Toronto server rather than your own machine.
You’ll see a directory structure from your home folder:
ie /home/rstudio/2024-03-Adv_Graphics_R/ and a folder to
Lecture_01_R_Introduction within. Clicking on that, you’ll
find Lecture_01.R-efresher.skeleton.Rmd which is the
notebook we will use for today’s code-along lecture.
We’ve implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it’s really not that bad. For this course, however, you don’t need to go through all of that just to improve on your data visualization skills.
R markdown notebooks also give us the option of inserting “markdown” text much like what you’re reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.
There is, however an appendix section at the end of this lecture detailing how to install the R-kernel itself and the integrated development environment (IDE) called RStudio.
So… what are in these packages? A package can be a collection of - functions - data objects - compiled code - functions that override base functions in R
Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).
In this course we will frequently rely on a package called
tidyverse which is also composed of a series of other
packages we can use to reformat our data like readr,
dplyr, tidyr and stringr.
Behind the scenes of each markdown notebook the R kernel is running. As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!
There are some options in the “Code” menu that can alleviate these problems such as “Run Region > Run All Chunks Above”. If you think you’ve made a big error by overwriting a key object, you can use that option to “re-initialize” all of your previous code!
The run order of your code is also visible at the side of each code
cell as [x]. When a code cell is still actively running it
will be denoted as [*] since a number cannot be assigned to
it. You’ll also notice your kernel (top right of the menu bar) has a
small circle that will be dark while running, and clear while idle.
Remember these friendly keys/shortcuts:
Arrow keys to navigate up and down (and within a
cell)Ctrl+Shift+Enter to run a
cell (both code and markdown)Alt+Ctrl+Enter to run the
next cellCtrl+Shift+C to quickly
comment and uncomment single or multiple lines of codeTab can be used while coding to autocomplete variable,
function and file names, and even look at a list of possible parameters
for functions.Ctrl+Alt+I to insert a new
coding cellDepending on your needs, you may find yourself doing the following:
Markdown allows you to alternate between “markdown” notes and “code” that can be run or re-run on the fly.
Each data run and it’s results can be saved individually as a new notebook to compare data and small changes to analyses!
Markdown is a markup language that lets you write HTML and Java Script code in combination with other languages. This allows you to make html, pdf, and text documents that are combinations of text and code, enhancing reproducibility, a key aspect in scientific work. Having everything in a single place also boosts productivity during results interpretation - no need to go back and forth between tabs, pages, and documents. They can all be integrated in a single document, allowing for a more fluid narrative of the story that you are communicating to your audience (less distractions for you!). For example, the lines of code below and the text you are reading right now were created in R’s Markdown language. (Do not worry about the R code just yet. We will get there sooner than you think).
As mentioned, markdown also allows you to write in LaTeX, a document preparation system to write mathematical notation. All it takes is to wrap LaTeX code between single dollar signs ($) for inline notation or two double dollar signs ($$), one at the beginning of the equation and one at the end. For example, the equation Yi = beta0 + beta1 xi + epsilon_i, i=1, …, N can be transformed into LaTeX code by adding some characters: ***Y_i = _0 + _1 x_i + _i, i=1, , N***. Now, if we use $$ before and after the LaTeX code, this is what we get:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots,N \]
See? Just like that! Here is an example of a table made in Markdown, showing some of the most popular R libraries for data science:
| Library | Use |
|---|---|
| tidyverse | Simplified tabular-data processing functions |
| ggplot2 | Data visualization package typically included in the tidyverse |
| shiny | Used to create interactive R-based web pages and interfaces |
| car | Popular statistical analysis with Type II and III ANOVA tables |
These are just a few examples of what you can do with Jupyter and Markdown. To find out more on how to get the best of Markdown, head on over to the [R Markdown cookbook] (https://bookdown.org/yihui/rmarkdown-cookbook/).
Once you are finished writing your code and interpreting those results in a markdown notebook, you can render the notebook into pdf, html, and many other formats. There are several ways to achieve this. The easiest option is to go to File > Knit Document. Afterwards there should be an option to view in browser at which point you can save as an HTML or print it to PDF.
Let’s discuss some important behaviours before we begin coding: - Code annotation (commenting) - Variable naming conventions - Best practices
# symbolWhy bother?
Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?
Credit: https://www.testbytes.net/blog/programming-memes/
You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.
How do I start?
It is, in general, part of best coding practices to keep things tidy and organized.
A hash-tag # will comment your text. Inside a code
cell in a Jupyter Notebook or anywhere in an R script, all text
after a hashtag will be ignored by R and by many other
programming languages. It’s very useful to add comments about changes in
your code, as well as detailed explanations about your scripts.
Put a description of what you are doing near your code at every
process, decision point, or non-default argument in a function. For
example, why you selected k=6 for an analysis, or the
Spearman over Pearson option for your correlation matrix, or quantile
over median normalization, or why you made the decision to filter out
certain samples.
Break your code into sections to make it readable. Scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.
Give your objects informative object names that are not the same as function names.
Comments may/should appear in three places:
# Example commenting section
# At the beginning of the script, describing the purpose of your script and what you are trying to solve
bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.
#---------- Section dividers helps organize code structure ----------#
## Feel free to add extra hash tags to visually separate or emphasize comments
Maintaining well-documented code is also good for mental health!
Stylistically, you have the following options:
The most important aspects of naming conventions are being concise
and consistent! Throughout this course you’ll see a hybrid system that
uses the underscore to separate words but a
period right before denoting the object type ie
this_data.object.
Start each script with a description of what it does.
Then load all required packages.
Consider what working directory you are in when sourcing a script.
Use comments to mark off sections of code.
Put function definitions at the top of your file, or in a separate file if there are many.
Name and style code consistently.
Break code into small, discrete pieces.
Factor out common operations rather than repeating them.
Keep all of the source files for a project in one directory and use relative paths to access them.
Keep track of the memory used by your program.
Always start with a clean environment instead of saving the workspace.
Keep track of session information in your project folder.
Have someone else review your code.
Use version control.
For more information on best coding practices, please visit swcarpentry
We all run into problems. We’ll see a lot of mistakes happen in class too! That’s OK if we can learn from our errors and quickly (or eventually) recover.
Usually when R generates an error it will produce some information about what has happened. This usually includes an error message detailing the kind of error it encountered or an error message generated by the function. It can also include a line where the error was encountered, or the name of the last function that was called before the error was encountered.
file does not exist: Use getwd() to check
where you are working, typelist.files() or the
Files pane to check that your file exists there, and
setwd() to change your directory if necessary. Preferably,
work inside an R project with all project-related files in that
same folder. Your working directory will be set automatically when you
open the project (this can be done by using
File -> New Notebook and following prompts).
typos: R is case sensitive so always check that you’ve spelled everything right. Get used to using the tab-autocompletion feature when possible. This can reduce typos and increase your overall programming speed.
open quotes, parentheses, brackets:
data type: Use commands like typeof() and
class() to check what type of data you have. Use
str() to peak at your data structures if you’re making
assumptions about it.
unexpected answers: To access the help menu,
type help("function"), ?function (using the
name of the function that you want to check), or
help(package = "package_name").
function not found: Make sure the package name is
properly spelled, installed, AND loaded. Libraries can be loaded to the
environment using the function library("package_name"). If
you only need one function from a package, or need to specify to what
package a function belongs because there are functions with the same
name that belong to different packages, you can use a double colon,
i.e. package_name::function_name.
the R bomb!!: The session aborted can
happen for a variety of reasons, like not having enough computational
power to perform a task or also because of a system-wide failure.
0. You will need to rerun your
previous cells!cheatsheets: Meet your new best friends: cheatsheets!
99% of the time, someone has already asked your question
Google, Stack overflow, R Bloggers, SEQanswers, Quora,
ResearchGate, RSeek, twitter, even reddit
Including the program, version, error, package and function helps, so be specific. Sometimes it is useful to include your operating system and version (Windows 10, Ubuntu 18, Mac OS 10, etc.).
You may run into assignment questions where the tools I’ve provided in lecture are not enough to reproduce the example output exactly as provided. If you wish to go that extra mile you may need to look for answers elsewhere by consulting references from the class or searching for it yourself. The truth is out there!
Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.
Last but not least, to make life easier: Under the Help
pane, there is a cheatsheet of Jupyter notebook keyboard
shortcuts or a browser list here.
There are many tips and tricks to remember about R but here we’ll quickly recall some foundational knowledge that could be relevant in later lectures.
If we want to hold on to a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!
-> and ->> Rightward
assignment: we won’t really be using this in our course.
<- and <<- Leftward
assignment: assignment used by most ‘authentic’ R programmers
but really just a historical keyboard throwback.
= Leftward assignment: commonly used
token for assignment in many other programming languages but holds dual
meaning!
In R, the assignment of a variable does not produce any standard output.
R processes at each new line unless you use a semicolon (;) to
separate commands. This applies to assignment as well. One exception
being when your function calls are spaced across lines and contained
within the ().
R calculates the right side of the assignment first the result is then applied to the left.
Data types are used to classify the basic spectrum of values that are used in R. Here’s a table describing some of the common data types we’ll encounter.
| Data type | Description | Example |
|---|---|---|
| character | Can be single or multiple characters (strings) of
letters and symbols. Assigned using double ' or
" |
a#c&E |
| integer | Whole number values, either positive or negative | 1 |
| double | Any number that is not an integer | 7.5 |
| logical | Also known as a boolean, representing the state of a conditional (question) | TRUE or FALSE |
| NA | Represents the value of “Not Available” usually seen when imported data has missing values | NA |
The job of data structures is to “host” the different data types. There are five basic types of data structures that we’ll use in R:
| Data structure | Dimensions | Restrictions |
|---|---|---|
| vector | 1D | Holds a single data type |
| matrix | 2D | Holds a single data type |
| array | nD | Holds a single data type |
| data frame | 2D | Holds multiple data types with some restrictions |
| list | 1D (technically) | Holds multiple data types AND structures |
Sometimes it is helpful to imagine Data Structures as real-world objects to understand how they are shaped and related to each other.
Also known as atomic vectors, each element within a vector must be of the same data type: logical, integer, double, character, complex, or raw.
For each vector there are two key properties that can be queried
with typeof() and length().
There is a numerical order to a vector, much like a queue AND you
can access each element (piece of data) individually or in groups.
Elements are ordered from 1 to
length(your_vector) and can be accessed with an indexing
operator []
Elements of a vector may be named, to facilitate subsetting by character vectors.
Elements of a vector may be subset by a logical vector.
# Build a character vector
char.vector <- c("Canada", ..., "Great Britain")
Error in eval(expr, envir, enclos): '...' used in an incorrect context
char.vector
Error in eval(expr, envir, enclos): object 'char.vector' not found
# subset by a single value
char.vector[...]
Error in eval(expr, envir, enclos): object 'char.vector' not found
# subset by multiple values
char.vector[...]
Error in eval(expr, envir, enclos): object 'char.vector' not found
# subset by removing values (cannot be mixed with positive values)
char.vector[c(-1, ...)]
Error in eval(expr, envir, enclos): object 'char.vector' not found
# subset with repeating multiple values
char.vector[c(1, 2, 3, ...)]
Error in eval(expr, envir, enclos): object 'char.vector' not found
# Build a named character vector by including variable names
character.vector <- c(a = ..., b = "United States", c = "Great Britain")
character.vector
# subset by element name
character.vector[c("a", ...)]
# subset by an explicit vector of logicals
character.vector[c(...)]
# Or subset by an implicit vector of logicals
character.vector[character.vector ...]
Error: <text>:12:35: unexpected symbol
11: # Or subset by an implicit vector of logicals
12: character.vector[character.vector ...
^
R will implicitly force (coerce) your vector to be of one data type. In this case, the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.
Type-casting examples: as.logical(),
as.integer(), as.double(),
as.numeric(), as.character(), and
as.factor()
Structure casting examples: as.data.frame(),
as.list(), and as.matrix()
Importantly, when coercing, the R kernel converts from more specific to general types usually in this order:
logical \(\rightarrow\) integer \(\rightarrow\) numeric \(\rightarrow\) complex \(\rightarrow\) character \(\rightarrow\) list.
# Make a logical vector and display its structure
logical.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
str(logical.vector)
logi [1:5] TRUE FALSE TRUE FALSE FALSE
# Make a numeric vector and display its structure
numeric.vector <- c(-1:10)
str(numeric.vector)
int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Make a mixed vector and display its structure. Take a note of its typing afterwards
mixed.vector <- c(FALSE, TRUE, 1, 2, "three", 4, 5, ...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
str(mixed.vector)
Error in str(mixed.vector): object 'mixed.vector' not found
# Attempt to coerce our vectors
# logical to numeric
as.numeric(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# numeric to logical
as.logical(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# numeric to character
as.character(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# mixed to a numeric. Note what happens when elements cannot be converted
as.numeric(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
Now that we have had the opportunity to create a few different vector objects, let’s talk about what an object class is. An object class can be thought of as a structure with attributes that will behave a certain way when passed to a function. Because of this
Some R package developers have created their own object classes. For
example, many of the functions in the tidyverse generate
tibble objects. They behave in most ways like a
data.frame but have a more refined print structure, making
it easier to see information such as column types when viewing them
quickly. In general, from a trouble-shooting standpoint, it is good to
be aware that your data may need to be formatted to fit a
certain class of object when using different packages.
After we are done tidying most of our datasets, they will be in tibble objects, but all of the basic data frame functions apply to these as well.
While matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames treat each column of the structure like a vector. The data frame, however, can have multiple data types mixed across each different column. Data frame rules to remember are:
Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.
# Generate a data frame with different variable/column types
mixed.df <- data.frame(... = character.vector,
... = numeric.vector[2:4],
... = logical.vector[1:3])
Error in data.frame(... = character.vector, ... = numeric.vector[2:4], : object 'character.vector' not found
# View the data frame
mixed.df
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Check the structure of the data frame
str(mixed.df)
Error in str(mixed.df): object 'mixed.df' not found
nrow(data_frame) retrieves the number of rows in a
data frame.
ncol(data_frame) retrieves the number of columns in
a data frame.
data_frame$column_name accesses a specific column by
it’s name.
data_frame[x,y] accesses a specific element located
at row x, column y
rownames(data_frame) retrieves or assigns row names
to your data frame
colnames(data_frame) retrieves or assigns columns
names to your data frame
There are many more ways to access and manipulate data frames that we’ll explore further down the road. Let’s review some basic data frame code.
# query the dimensions of the data frame
dim(mixed.df)
Error in eval(expr, envir, enclos): object 'mixed.df' not found
nrow(mixed.df)
Error in nrow(mixed.df): object 'mixed.df' not found
ncol(mixed.df)
Error in ncol(mixed.df): object 'mixed.df' not found
# retrieve row and column names
rownames(mixed.df)
Error in rownames(mixed.df): object 'mixed.df' not found
colnames(mixed.df)
Error in is.data.frame(x): object 'mixed.df' not found
# print the mixed data frame
mixed.df
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Access portions of the data frame
# a single column
str(mixed.df$...)
Error in str(mixed.df$...): object 'mixed.df' not found
# a single element
mixed.df[2, 3]
Error in eval(expr, envir, enclos): object 'mixed.df' not found
mixed.df[3, ...]
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# multiple rows
mixed.df[c(1,3), ]
Error in eval(expr, envir, enclos): object 'mixed.df' not found
mixed.df[-2, ]
Error in eval(expr, envir, enclos): object 'mixed.df' not found
Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types to pass around your scripts, and functions, or when receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!
If you forget the contents of your list, use the str()
function to check out its structure. str() will tell you
the number of items in your list and their data types.
# Make a named list of various items
mixed.list <- list(countries = character.vector, values = numeric.vector, mixed.data = ...)
Error in eval(expr, envir, enclos): object 'character.vector' not found
# Look at some information about our list
str(mixed.list)
Error in str(mixed.list): object 'mixed.list' not found
# What are the names of the elements in mixed.list
names(mixed.list)
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# Lists can often be unnamed
unnamed.list <- list(character.vector, numeric.vector, ...)
Error in eval(expr, envir, enclos): object 'character.vector' not found
# Look at some information about our unnamed list
str(unnamed.list)
Error in str(unnamed.list): object 'unnamed.list' not found
names(unnamed.list)
Error in eval(expr, envir, enclos): object 'unnamed.list' not found
Accessing lists is much like opening up a box of boxes of chocolates. You never know what you’re gonna get when you forget the structure!
You can access elements with a mixture of number and naming
annotations much like data frames. Also [[x]] is meant to
access the xth “element” of the list. Note that unnamed lists
cannot be accessed with naming annotations.
[x] returns a list object with your element(s) of
choice in the list.[[x]] returns a “single” element only but that element
could be a vector, data frame, list, etc.# Subset our list with []
mixed.list[c(...)]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
mixed.list[...]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# Pull out a single element
mixed.list[[2]]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
mixed.list[["countries"]]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# Give a vector as input to [[]]
mixed.list[[c(1,3)]]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# vs equivalent
mixed.list[[1]][3]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# Access a single element from a data frame nested in a list
mixed.list[[c(...)]]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
# vs equivalient
mixed.list[[3]][...]
Error in eval(expr, envir, enclos): object 'mixed.list' not found
Comprehension Question 2.2.4.1: Suppose we had a list named multiDF.list consisting of 3 data frames, as shown in the following code cell. How would you subset the 2nd and 3rd data frames into their own list? How would you access the “values” column from the 3rd data frame? Use the following code cell to help you out.
multiDF.list = list(mixed.df, rbind(mixed.df, mixed.df), rbind(mixed.df, mixed.df, mixed.df))
Error in eval(expr, envir, enclos): object 'mixed.df' not found
str(multiDF.list)
Error in str(multiDF.list): object 'multiDF.list' not found
# Subset the 2nd and 3rd dataframes as their own list
...
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Output the "values" column of the 3rd dataframe
...
Error in eval(expr, envir, enclos): '...' used in an incorrect context
Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake. Adding or changing data in a data frame with pre-existing factors requires that you match factor levels correctly as well.
Factors make perfect sense if you are a statistician designing a
programming language (!) but to everyone else they exist solely to
torment us with confusing errors. At its core, a factor is really just
an integer vector or character data with an additional attribute, called
levels(), which defines the accepted values for that
variable.
Why not just use character vectors, you ask?
Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.
Since the inception of R, data.frame() calls have been
used to create data frames but the default behaviour
was to convert strings (and characters) to factors! This is a throwback
to the purpose of R, which was to perform statistical analyses on
datasets with methods like ANOVA which examine the
relationships between variables (ie factors)!
As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wonder why they can’t pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspciouslySpecific
That meant that users usually had to create data frames including the toggle
data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)
Fret no more! As of R 4.x.x the default behaviour
has switched and stringsAsFactors = FALSE is the
default! Now if we want our characters to be factors,
we must convert them explicitly, or turn this
behaviour on at the outset of creating each data frame!
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
...)
)
Error in str(data.frame(country = character.vector, values = numeric.vector[2:4], : '...' used in an incorrect context
# Explicitly define factors for each variable.
str(data.frame(country = ...(character.vector),
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = FALSE)
)
Error in ...(character.vector): could not find function "..."
factors and their levels explicitly
during or after data.frame creationFrom above, you can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.
R’s default behaviour puts factor levels in alphabetical
order. This can cause problems if we aren’t aware of it. You can
check the order of your factor levels with the levels()
command. Furthermore you can specify, during factor creation, your level
order.
Always check to make sure your factor levels are what you expect.
With factors, we can deal with our character levels directly, or their numeric equivalents.
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = factor(c("North America", "North America", "Europe"),
... = c("North America", "Europe"))
)
)
Error in data.frame(country = character.vector, values = numeric.vector[2:4], : object 'character.vector' not found
# Coerce a factor
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"))
Error in data.frame(country = character.vector, values = numeric.vector[2:4], : object 'character.vector' not found
# Set our factor after declaring the data frame
mixed.df$continent <- factor(..., levels=c("North America", "Europe"))
Error in eval(expr, envir, enclos): '...' used in an incorrect context
str(mixed.df)
Error in str(mixed.df): object 'mixed.df' not found
Use levels() to list the levels and their order for
your factor
To rename levels of a factor, declare and reassign your factor.
Move a single level to the first position within your factor
levels with relevel().
Factor levels can be assigned an order of precedence during their
creation with the parameter ordered = TRUE.
Define labels for your factor during their creations with the
parameter labels = c(). Note that level order is assigned
before labels are added to your data. You are essentially
labeling the integer assigned to your factor levels so be
careful when using this parameter!
Advanced factors functions with forcats If you’re looking for more advanced functions that you can use to manipulate, sort or update factors, check out the forcats function. With it, you can refactor based on functions, frequency, or explicitly re-specify the order of one or more factor levels. We’ll see this package in action in more detail during later lectures.
Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!
TRUE/FALSE): coercion to numeric before
applying operationsTherefore be careful to specify your numeric data for mathematical operations.
mixed.df
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Add to each element
mixed.df$values + 3
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Add columns to each other
mixed.df$values + mixed.df$values
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# multiply each element by a constant
mixed.df$values * 4
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# implicit coercion of logical to integer
mixed.df$commonwealth * 5
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Perform math on a factor
mixed.df$continent * 6
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Convert the factor to a numeric first
as.numeric(mixed.df$continent) * 7
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Can we perform math on non-numeric variables?
...
Error in eval(expr, envir, enclos): '...' used in an incorrect context
apply() family of functions to perform
actions across data structuresThe above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
apply() function will recognize basic
functions and use them on vectorized dataThe above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
For example, we might have a count table where rows are genes,
columns are samples, and we want to know the sum of all the counts for a
gene. To do this, we can use the apply() function.
apply() Takes an array, matrix (or something that can be
coerced as such, like a numeric data frame), and applies a function over
rows or columns. The apply() function takes the following
parameters:
X: an array. matrix or something that can be coerced to
these objectsMARGIN: defines how to apply the function;
1 = rows, 2 = columns.FUN: the function to be applied. Supplied as a function
name without the () suffix...: this notation means we can pass additional
parameters to our function defined by FUN.and returns a vector, array or list depending on the nature of X.
Let’s practice by invoking the sum function.
# Make a sample data frame of numeric values only
numeric.df = data.frame(geneA = numeric.vector, geneB = numeric.vector*2, geneC = numeric.vector*3)
numeric.df
# Apply sum by rows
apply(numeric.df, ..., sum)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Apply sum by columns
apply(numeric.df, ..., sum)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
apply() familyThere are 3 additional members of the apply() family
that perform similar functions with varying outputs
lapply(data, FUN, ...) is usable on dataframes, lists,
and vectors. It returns a list as output.FUN will be applied from the
...sapply(data, FUN, ...) works similarly to
lapply() except it tries to simplify the output to the most
elementary data structure possible. i.e. it will return the simplest
form of the data that makes sense as a representation.
mapply(FUN, data, ...) is short for “multivariate”
apply and it applies a function to multiple lists or multiple vector
arguments.
# Use lapply on the columns of numeric.df
...(numeric.df, sum)
Error in ...(numeric.df, sum): could not find function "..."
str(lapply(numeric.df, sum))
List of 3
$ geneA: int 54
$ geneB: num 108
$ geneC: num 162
# Use sapply on the columns of numeric.df
...(numeric.df, sum)
Error in ...(numeric.df, sum): could not find function "..."
str(sapply(numeric.df, sum))
Named num [1:3] 54 108 162
- attr(*, "names")= chr [1:3] "geneA" "geneB" "geneC"
# Using lapply and sapply and sum on an actual list
sum.list <- list(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
str(sum.list)
Error in str(sum.list): object 'sum.list' not found
# lapply on the list returns a list
lapply(sum.list, sum)
Error in lapply(sum.list, sum): object 'sum.list' not found
# sapply on the list returns a vector
sapply(sum.list, sum)
Error in lapply(X = X, FUN = FUN, ...): object 'sum.list' not found
# Use lapply to select portions from a list
sum.list <- list(numeric.df, numeric.df)
# Extract the first row from each member of the list
lapply(sum.list, ...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Extract the 2nd column from each member of the list
lapply(sum.list, "[", , 2)
[[1]]
[1] -2 0 2 4 6 8 10 12 14 16 18 20
[[2]]
[1] -2 0 2 4 6 8 10 12 14 16 18 20
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", , 2)
[,1] [,2]
[1,] -2 -2
[2,] 0 0
[3,] 2 2
[4,] 4 4
[5,] 6 6
[6,] 8 8
[7,] 10 10
[8,] 12 12
[9,] 14 14
[10,] 16 16
[11,] 18 18
[12,] 20 20
Notice how in using sapply() to extract from a list of
data frames, a single matrix was returned - a single output in the
simplest form that maintains structure.
Now let’s give mapply() a try.
# Use mapply in an example on numeric.vector
mapply(sum, numeric.vector, numeric.vector)
[1] -2 0 2 4 6 8 10 12 14 16 18 20
# Use mapply in an example on numeric.df
mapply(sum, numeric.df, numeric.df)
geneA geneB geneC
108 216 324
# Use mapply on the rep function to see its output
mapply(rep, c(...), 4)
Error in mapply(rep, c(...), 4): '...' used in an incorrect context
NA and NaN valuesMissing values in R are handled as NA (Not Available).
Impossible values (like the results of dividing by zero) are represented
by NaN (Not a Number). These types of values can be
considered null values. These two types of values, specially
NAs, have special ways to be dealt with, otherwise it may lead to errors
in some functions.
For our purposes, we are not interested in keeping NA
data within our datasets so we will usually detect and remove them or
replace them within our data after it is imported.
NA datais.na() returns a logical vector reporting which values
from your query are NA.complete.cases() returns a logical for row without any
NA values.NA values with the
na.rm = TRUE parameter: ie mean(),
sum() etc.tidyr package can also be
used to work with NA values.# Add some NAs to our data frame
mixed.df <- data.frame(country = character.vector,
values = c(3, ..., 9),
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
measure = c("metric", NA, "metric")
)
Error in data.frame(country = character.vector, values = c(3, ..., 9), : object 'character.vector' not found
# Look at our updated data frame
mixed.df
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Which entries are NA?
is.na(mixed.df)
Error in eval(expr, envir, enclos): object 'mixed.df' not found
# Which rows are incomplete?
complete.cases(mixed.df)
Error in complete.cases(mixed.df): object 'mixed.df' not found
# Use some math functions
sum(mixed.df$values, ...)
Error in eval(expr, envir, enclos): object 'mixed.df' not found
| ::: {align=“center”}
|
tidyverseLet’s begin with some definitions: - Variable: A part of an experiment that can be controlled, changed, or measured. - Observation: The results of measuring the variables of interest in an experiment.
In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data. To read more about wide and long formats, visit here.
Why tidy data?
Data cleaning/wrangling (or dealing with ‘messy’ data) accounts for a huge chunk of a data scientist’s time. Ultimately, we want to get our data into a ‘tidy’ format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that standardized data structure can help this process along.
In Tidy data:
This seems pretty straightforward, and it is. It is the datasets you get that will not be straightforward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking the same puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. In that case, each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.
Let’s begin this journey with data import.
readr package -
“All roads lead to Rome..”… but not all roads are easy to travel.
Depending on format, data files can be opened in a number of ways.
The simplest methods we will use involve the readr package
as part of the tidyverse. These functions have already been
developed to simplify the import process for users. The functions we
will use most often are:
Read in a delimited file: read_delim(),
read_csv(), read_tsv(),
read_csv2() [European datasets]
Read in from a file, line by line:
read_lines()
Let’s read in our first dataset so that we can convert from wide to long format.
# Use read_csv to look at our PHU daily case data
covid_phu.df <- read_csv(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Check the structure and characteristics of covid_phu
str(..., give.attr = FALSE)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
head(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
tail(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
any(is.na(covid_phu.df))
Error in eval(expr, envir, enclos): object 'covid_phu.df' not found
From looking at our data public health unit data, we can see that it begins tracking on 2020-01-23 and goes up until 2023-01-25. That’s over 3 years for anyone keeping track! In total there are observations for 1,099 days across 34 public health units. The final column appears to be a tally running for total cases reported on that date. Although the data no longer represents a semi-accurate accounting of the state of case reporting in Ontario, it will serve (for now) as a good refresher on how to do basic data wrangling.
From the outset, we can see there are some issues with the data set
that we’ll want to resolve and we’ll work through some
tidyverse functions in order to do that. First let’s
quickly review some of the potential problems with our dataset.
There are 34 public health units and a total count for each date.
It is preferable for data visualization to collapse all of those public
health units into a single variable so that we have a single value
new_cases for each Date observation. At the
same time we will not collapse Total into that same
variable.
The data is rife with NA values. Many instance are
likely due to no data being collected on those dates. For our purposes,
it may be simpler to replace them with a value of 0.
Our public health unit names are a little clunky. We should trim them down to simpler region names.
In the end, we want to convert our data to look something like this:
| date <date> | total_phu_new <dbl> | public_health_unit <fct> | new_cases <dbl> |
|---|---|---|---|
| 2020-01-23 | 0 | Algoma | 0 |
| 2020-01-23 | 0 | Brant_County | 0 |
| 2020-01-23 | 0 | Chatham_Kent | 0 |
| … | … | … | … |
Before we tackle these issues, let’s go ahead and review some of the tools at our disposal.
tidyverse package and it’s contents make
manipulating data easierWhile the tidyverse is composed of multiple packages, we will be
focused on working with a subset of these: dplyr,
tidyr, and stringr.
%>% whenever you
can!To save on making extra variables in memory and to help make our code
more concise, we should use of the %>% symbol. This is a
redirection or pipe symbol similar to the | in Unix
operating systems and is used for redirecting output from one function
to the input of another. By thoughtfully combining this with other
commands, we can alter or query our datasets with ease.
We’ll also introduce the %<>% in this class. This
is a little more advanced but it allows us to assign the final product
of our chain of commands to the very first object.
Whenever we are redirecting, we are implicitly passing our output to
the first parameter of the next function. We may not always want to use
the entirety of the output or we may want to also reuse that redirected
output as part of another parameter. To do so we can use .
to explicitly denote the redirected output.
dplyr has functions for accessing and altering
your dataWe will use the “verbs” of the dplyr function often to
massage the look of our data by changing column names or subsetting it.
The most common verbs you will see in this course are.
| Function(s) | Description |
|---|---|
arrange() |
Arranging rows by column values |
count(), tally() |
Counting observations by group |
distinct() |
Subsetting rows by distinct or unique values |
filter() |
Subsetting rows by column values |
mutate(), transmute() |
Create, modify, or delete columns |
select() |
Subset columns using their names and types |
summarize() or
summarise() |
Summarize by groups to fewer rows |
group_by() vs. ungroup() |
group by one or more variables |
rowwise() |
group data as single rows for calculations across each |
rename(), and relocate() |
Rename or move columns |
tidyr has additional functions for reshaping
our dataThe tidyr package will be most useful when we are trying
to reshape our data from the wide to the long format or vice
versa. This is much more useful for when we want to drastically
alter portions or all of our data.
| Function(s) | Description |
|---|---|
pivot_longer() |
Pivot data from wide to long |
pivot_wider() |
Pivot data from long to wide |
extract() |
Extract a character column into multiple groups |
separate() |
Separate a character column into multiple groups |
unite() |
Unite multiple columns into one by pasting strings |
drop_na() |
Drop rows containing missing values |
replace_na() |
Replace NAs with specific values |
stringr provides functionality for searching
data based on regular expressionsThe stringr package will come in most useful when we are
trying to fix string issues with our data. Many time our headers or data
will contain spaces or poor formatting. Many times we will prefer to
have our headers in lower case format, with any spaces replaced by an
_. We’ll also use verbs from this package to make any
variables or data more concise.
| Category | Function(s) | Description |
|---|---|---|
| String analysis | str_count() |
Count the number of matches in a string |
| String retrieval | str_detect() |
Detect the presence (or absence) of a pattern in string |
str_extract() and
str_extract_all() |
Extract matching patterns from a string | |
str_match() and
str_match_all() |
Extract matched groups from a string | |
str_subset() and
str_which() |
Keep or find strings matching a pattern | |
| String alteration | str_remove() and
str_remove_all() |
Remove matched patterns from a string |
str_split(),
str_split_fixed(), and str_split_n() |
Split a string into pieces | |
str_c() |
Concatenate multiple strings into a single string with optional separator | |
str_flatten() |
Flatten a string | |
str_sub() |
Extract and replace substrings from a character vector | |
str_to_upper() and
str_to_lower() |
Convert case of a string |
Time to tackle our dataset!
pivot_longer()As you may recall, our PHU data is formatted such that each column represents new cases per day for a single PHU. It’s a great way to format for data entry and certainly reduces on redundancy. However, for us to work with this data, we want to collapse all of those PHUs into a single column.
Today we will use the pivot_longer() function to convert
our wide-format data over to long-format. For our purposes, we will rely
on four parameters: 1. data: the data frame (and columns)
that we wish to transform. 2. cols: the columns that we
wish to gather/collapse into a long format. 3. names_to:
the variable name of the new column to hold the collapsed
information from our current columns. 4. values_to: The
variable name of the values for each observation that we are collapsing
down.
We’ll be using a series of %>% so for now we won’t
save our work to a new object.
# A reminder of what our data looks like
covid_phu.df %>% head()
Error in head(.): object 'covid_phu.df' not found
# Start with our wide-format phu data
covid_phu.df %>%
# Pivot the data into a long-format set
pivot_longer(cols = ..., names_to = ..., values_to = ...) %>%
# Just take a quick look at the output.
str()
Error in str(.): '...' used in an incorrect context
NA values from our data with
replace_na()Our conversion to long format creates 37,366 observations relating a
Date to a new_cases value in a specific
Public_Health_Unit (or total). From the looks of our data,
however, we do have NA values under our
new_cases variable.
We have two options: 1. Remove the NA observations from
our data set. There won’t be any loss of information since we could
rebuild the original data if we really needed to. 2. Replace the
NA observations with a value that makes sense for our
analysis.
Let’s replace the missing observations with a new value, 0, using
replace_na(). This function will need two parameters:
data: the data frame or vector that it will scan for
NA values.replace: the value that we will use to replace
NA.We’re going to update our pipe of commands and save the final output
into a new variable covid_phu_long.df.
# Pivot the data into a long-format set and remove NAs from the value table
covid_phu_long.df <-
covid_phu.df %>%
pivot_longer(cols = c(2:35), names_to = "public_health_unit", values_to = "new_cases") %>%
### Change the values of "new_cases" using the mutate function
mutate(new_cases = ...)
Error in covid_phu.df %>% pivot_longer(cols = c(2:35), names_to = "public_health_unit", : '...' used in an incorrect context
# Check that we have covered all of the NA values in our data frame by looking for complete cases
nrow(covid_phu_long.df[complete.cases(covid_phu_long.df),])
Error in nrow(covid_phu_long.df[complete.cases(covid_phu_long.df), ]): object 'covid_phu_long.df' not found
# Or just check for NA values
any(is.na(covid_phu_long.df))
Error in eval(expr, envir, enclos): object 'covid_phu_long.df' not found
# Take a look at the Public Health Unit names
print(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
str_replace_all()Looking at our PHU names, we can see that there is a lot of redundancy in our names. We see they may sometimes end or begin in some form of: - _District - _Region - _City - County - City_of
We have a couple of choices but we can either use
str_replace_all() or a specific version of that,
str_remove_all(), which simply replaces a pattern with an
empty character.
For str_replace_all() we will supply: 1.
string: a single string or vector of strings. 2.
pattern: the pattern we wish to search for in the form of a
string or regular expression. 3. replace: the replacement
string we wish to use.
For the purposes of our visualization and now that these are now
longer column names, we will replace all remaining underscore
(_) characters with a space. To wrap that up we’ll convert
our updated variable to a factor and overwrite our original
covid_phu_long.df.
We will accomplish this all through multiple calls to mutate.
# Clean up the Public Health Unit names
covid_phu_long.df %<>%
# Replaces our public_health_unit values with ones where we remove excess verbage
mutate(public_health_unit = str_replace_all(string = ...,
pattern = ...,
replace = "")) %>%
# From the updated version of public_health_unit, replace all _ with a " "
mutate(public_health_unit = str_replace_all(string = ...,
pattern = "_",
replace = " ")) %>%
# Now make sure that it's a factor for later
mutate(public_health_unit = as.factor(public_health_unit))
Error in mutate(., public_health_unit = str_replace_all(string = ..., : object 'covid_phu_long.df' not found
# Take a look at the new set of phu names
print(levels(covid_phu_long.df$public_health_unit))
Error in levels(covid_phu_long.df$public_health_unit): object 'covid_phu_long.df' not found
# Take a quick look at our final dataset
head(covid_phu_long.df)
Error in head(covid_phu_long.df): object 'covid_phu_long.df' not found
# Make a quick copy here too
covid_phu_long_copy.df = covid_phu_long.df
Error in eval(expr, envir, enclos): object 'covid_phu_long.df' not found
rename() variables for clarityNow that we have the basic structure for our data, we want to clean
it up just a little bit by renaming our Total column to
clarify that it represents total new cases across all PHUs for that
date. Why did we keep this column separate? Now we can use this
information to generate percentage totals for each PHU if we choose to.
We’ll also change our Date column to lower case at the same
time.
We’ll use rename() from dplyr to accomplish
the task of renaming our column. There are a number of ways you could
accomplish this without using dplyr but the simplicity of
it is nice.
# Rename our Total column to clarify it's meaning
covid_phu_long.df %>%
rename(... = Total,
... = Date) %>%
head()
Error in rename(., ... = Total, ... = Date): object 'covid_phu_long.df' not found
relocate()The last cleanup we can accomplish with our data is to move
total_phu_new to the last column of our data frame. This is
for personal preference but also makes more sense when simply looking at
the data. The relocate() verb from dplyr
accomplishes this with ease since we are not dropping or removing
columns. It uses some extra syntax to help accomplish its functions:
.data: the data frame or tibble we want to alter...: the columns we wish to move.before or .after: determines the
destination of the columns. Supplying neither will move columns to the
left-hand side.In fact, relocate() can be used to rename a column as
well but it will also be moved by default so consider the ramifications
of such an action!
Note: We could accomplish a similar result using the
select command as well. It’s really up to what you’re
comfortable with but it is much simpler to use relocate()
when you are working with a large number of columns and you want to move
one to a specific location.
# Rename our Total column to clarify it's meaning
covid_phu_long.df %<>%
rename(total_phu_new = Total,
date = Date) %>%
# relocate our total column to the right side
relocate(total_phu_new, ... = new_cases)
Error in rename(., total_phu_new = Total, date = Date): object 'covid_phu_long.df' not found
head(covid_phu_long.df)
Error in head(covid_phu_long.df): object 'covid_phu_long.df' not found
Comprehension Question 3.2.5: In the above example we used the relocate() function to move the “total_phu_new” column to the end of our data frame. What other methods could we use to accomplish the same feat? Use the below code cell to help yourself out.
# Relocate our target column using the select() command
covid_phu_long_copy.df %>%
rename(total_phu_new = Total,
date = Date) %>%
# relocate our total column to the right side
... %>%
head()
Error in ...(.): could not find function "..."
At this point we have completed the data wrangling we want to
accomplish on this dataset. We’ve converted it to a long-format and
renamed the PHU entries while removing any NA values that
may cause issues. There are a number of ways we could save this data now
either as a text file or in its current form as a data frame in a
.RData format.
write_delim(),
write_csv(), write_tsv(),
write_excel_csv()write_lines()save()load()Let’s try some of those methods now.
# Check the files names we currently have
print(dir("./data/"))
[1] "hospitalizations_pt.csv"
[2] "Lecture01.RData"
[3] "Ontario_covidtesting.csv"
[4] "Ontario_daily_change_in_cases_by_phu.csv"
[5] "Ontario_daily_change_in_cases_by_phu_long.RData"
[6] "Ontario_daily_change_in_cases_by_phu_long.tsv"
[7] "Ontario_phu_data.all.facet.png"
[8] "region_hospital_icu_covid_data.csv"
# Write covid_phu_long.df to a tab-delimited file
...(covid_phu_long.df, file = "./data/Ontario_daily_change_in_cases_by_phu_long.tsv")
Error in ...(covid_phu_long.df, file = "./data/Ontario_daily_change_in_cases_by_phu_long.tsv"): could not find function "..."
# Check our file names after writing
print(dir("./data/"))
[1] "hospitalizations_pt.csv"
[2] "Lecture01.RData"
[3] "Ontario_covidtesting.csv"
[4] "Ontario_daily_change_in_cases_by_phu.csv"
[5] "Ontario_daily_change_in_cases_by_phu_long.RData"
[6] "Ontario_daily_change_in_cases_by_phu_long.tsv"
[7] "Ontario_phu_data.all.facet.png"
[8] "region_hospital_icu_covid_data.csv"
# Save our data frame as an object
save(covid_phu_long.df, file="./data/Ontario_daily_change_in_cases_by_phu_long.RData")
Error in save(covid_phu_long.df, file = "./data/Ontario_daily_change_in_cases_by_phu_long.RData"): object 'covid_phu_long.df' not found
# Check our file names after saving
print(dir("./data/"))
[1] "hospitalizations_pt.csv"
[2] "Lecture01.RData"
[3] "Ontario_covidtesting.csv"
[4] "Ontario_daily_change_in_cases_by_phu.csv"
[5] "Ontario_daily_change_in_cases_by_phu_long.RData"
[6] "Ontario_daily_change_in_cases_by_phu_long.tsv"
[7] "Ontario_phu_data.all.facet.png"
[8] "region_hospital_icu_covid_data.csv"
readxl and writexl packages for
working with excel spreadsheetsNot all of your data may come as a comma- or tab-delimited format. In
the case of excel spreadsheets there are some packages available that
can also facilitate the parsing of these more complex files. The
readxl package is part of the tidyverse but
writexl package is not. There are other means of writing to
an excel file format but they are dependent on other programs (like Java
or Excel) or their drivers.
From the readxl package
excel_sheets()read_excel()From the writexl package (not a part of the tidyverse)
but independent of Java and Excel
write_xlsx()ggplot2We now have some data in a tidy format that we’d like to visualize.
We can begin with some initial analyses of the data using the
ggplot2 package. It has all of the components we need to
help us decide on which data we want to focus on or keep. There are a
number of ways to visualize our data and here we will refresh our
ggplot skills.
Basic ggplot notes: - ggplot objects hold a complex
number of attributes but always need an initial source of data
ggplot objects can be modified with the +
symbol by adding in layers
ggplot objects can be plotted, saved, and passed
around.As we start to produce plot figures, they’ll vary in size depending on your needs. In an R Markdown code cell, you can set your figure size using the code cell attributes much like the parameters of a function. You can set the figure size dimensions using fig.width and fig.height. As we proceed in the future, you’ll see us setting these attributes within our code cells.
# Initialize a plot with our data
phu.plot <- ggplot(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look at the structure of the data
str(phu.plot)
Error in str(phu.plot): object 'phu.plot' not found
We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We’ll begin with a simple line graph of all the public health units across all dates within the set.
In order to update or add layers to a ggplot object, we
can use the + symbol for each command. For instance, to
define the source of x-axis and y-axis data, we use aes()
command to update the aesthetics layer. Remember how we defined the
public_health_unit variable as a factor? We’ll take
advantage of that here and tell ggplot to give each PHU
it’s own colour.
After defining our aesthetics, we still need to tell
ggplot how to actually graph the data. The
ggplot package comes with an abundance of visualizations
accessed through the geom_*() commands. Some examples
include
geom_point() for scatterplotsgeom_line() for line graphsgeom_boxplot() for boxplotsgeom_violin() for violin plotsgeom_bar() for bargraphsgeom_histogram() for histograms# Update the aesthetics with axis and colour information, then add a line graph!
phu.plot +
# 2. Aesthetics
aes(x = ..., y = ..., colour = ...) +
theme(text = element_text(size = 20)) + # set text size
guides(colour = guide_legend(title="Public Health Unit")) + # Legend title
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
# 4. Geoms
geom_line()
Error in eval(expr, envir, enclos): object 'phu.plot' not found
facet_wrap() command to break PHUs into
separate graphsThere’s a lot of data on that graph and some of it is quite drowned
out because of the scale of PHUs with many more cases. To break out each
PHU individually, we can add the facet_wrap() command.
We’ll also update some of the parameters:
scale: we will update this so each y-axis scale is
determined by PHU-specific data.ncol: use this to set the number of columns displayed
in our gridAt the same time, we’ll also get rid of the legend since each individual graph will be labeled by its PHU.
# This is going to be a big graph so adjust our plot window sizes for us
options(repr.plot.width=20, repr.plot.height=30)
# Add a facet_wrap and get rid of the legend
phu_facet.plot <- phu.plot +
# 2. Aesthetics
aes(x = date, y = new_cases, colour = public_health_unit) +
theme(text = element_text(size = 20)) + # set text size
# Give titles to your axes
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across Ontario Public Health Units") +
# Remove the legend
theme(legend.position = "none") +
# 4. Geoms
geom_line() +
# 7. ### 4.2.0 Facet our data by PHU
facet_wrap(~ ..., scales = ..., ncol=...)
Error in eval(expr, envir, enclos): object 'phu.plot' not found
# Display our plot
phu_facet.plot
Error in eval(expr, envir, enclos): object 'phu_facet.plot' not found
ggsave() command to save your plots to a
fileThere are a number of ways you can use the ggsave()
command to specify how you want to save your files.
# What is our working directory?
getwd()
[1] "C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Advanced_Graphics_in_R/2024.03_Adv_Graphics_R/Lecture_01_R_Introduction"
C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Advanced_Graphics_in_R/2024.03_Adv_Graphics_R/Lecture_01_R_Introduction
# Save the plot we've generated to the root directory of the lecture files.
ggsave(...,
filename = "data/Ontario_phu_data.all.facet.png",
scale=2,
device = "png",
units = c("cm"), width = 20, height = 30)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a look at the directory
dir("data/")
[1] "hospitalizations_pt.csv"
[2] "Lecture01.RData"
[3] "Ontario_covidtesting.csv"
[4] "Ontario_daily_change_in_cases_by_phu.csv"
[5] "Ontario_daily_change_in_cases_by_phu_long.RData"
[6] "Ontario_daily_change_in_cases_by_phu_long.tsv"
[7] "Ontario_phu_data.all.facet.png"
[8] "region_hospital_icu_covid_data.csv"
hospitalizations_pt.csv
Lecture01.RData
Ontario_covidtesting.csv
Ontario_daily_change_in_cases_by_phu.csv
Ontario_daily_change_in_cases_by_phu_long.RData
Ontario_daily_change_in_cases_by_phu_long.tsv
Ontario_phu_data.all.facet.png
region_hospital_icu_covid_data.csv
Although we do have a running total for each date, what if we want to look at the totals cases across subsets of the PHUs? Using a barplot we can stack cases by date and get a sense of daily case totals from which sets of PHUs we desire.
This time we will use geom_bar() to display our data and
tell it to use the values from our new_cases variable to
generate the totals. We do this by setting the
stat = "identity" parameter.
At the same time, let’s update our colours to use a colour-blind friendly palette scheme.
phu.plot +
# 2. Aesthetics
aes(x = date, y= new_cases, fill = ...) + # set our fill colour instead of line colour
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
# Give titles to your axes
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# Set up our barplot here
geom_bar(...) +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
Error in eval(expr, envir, enclos): object 'phu.plot' not found
From above we get a sense of overall totals for some PHU
distributions but it’s still too much to look at. Let’s transform our
x-axis values so we can bin by months instead. To accomplish this we’ll
use the as.yearmon() function found in the zoo
package we loaded at the beginning of the lecture.
phu.plot +
# 2. Aesthetics
aes(x = ..., ### 4.4.1 Update the x-axis by transforming the values to year-month format
y= new_cases,
fill = public_health_unit) + # set our fill colour instead of line colour
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
# Give titles to your axes
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per month across all Ontario Public Health Units") +
# Set up our barplot here
geom_bar(stat = "identity") +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
Error in eval(expr, envir, enclos): object 'phu.plot' not found
Now that we have taken an initial look at our data, we can see that even after converting our axis to a month-year format, it appears that some of the data isn’t that relevant for us. Some of the PHUs are not generating many new cases per day so we can now consider slicing our data up to look at specific regions.
Let’s look at the top 10 regions by total caseload across the dataset.
# What are the top 10 regions by total caseload?
covid_phu_long.df %>%
# group the data by public health unit
group_by(...) %>%
# Summarize it by the total number of new cases in each PHU
summarise(...) %>%
# Sort all of the data in descending order by total cases
arrange(...) %>%
# take the top 10 PHUs
.[1:10, ]
Error in covid_phu_long.df %>% group_by(...) %>% summarise(...) %>% arrange(...) %>% : '...' used in an incorrect context
# Generate a list of all PHUs and sort by total caseload
# Generate a list of all PHUs and sort by total caseload
phu_by_total_cases_desc <- covid_phu_long.df %>%
# Group by public health unit
group_by(public_health_unit) %>%
# Based on public health unit, sum the total cases
summarise(total_cases = sum(new_cases)) %>%
# Sort by descending order
arrange(desc(total_cases)) %>%
# Grab the PHU names and convert them into a character vector
select(...) %>%
unlist() %>%
as.character() # Coercion to a vector removes the names. unname() works as well.
Error in unlist(.): '...' used in an incorrect context
# Take a look at the public health units
print(phu_by_total_cases_desc)
Error in print(phu_by_total_cases_desc): object 'phu_by_total_cases_desc' not found
filter() command to make a subset of our
dataNow that we have a list of PHUs ordered by descending total cases, we
can use that to filter our covid_phu_long.df dataframe and
graph only the more heavily infected PHUs. We can then pipe the filtered
data over to make a ggplot() object. At the same time we’ll
do a few more things:
# Make a bar graph
covid_phu_long.df %>%
### 4.5.1 Filter our data based on the PHUs we want to see
filter(...) %>%
# Redirect our new data frame to ggplot
ggplot(.) +
# 2. Aesthetics
aes(x = as.yearmon(date),
y= new_cases,
fill = fct_reorder(public_health_unit, new_cases)) + # reordering the levels of the data supplied
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
# Give titles to your axes
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per month across top 3 Ontario Public Health Units") +
# Set up our barplot here
geom_bar(stat = "identity") +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
Error in ggplot(.): '...' used in an incorrect context
We can see from our first graph of daily case loads that there can be quite a bit of variability from day to day. Rather than look at the daily tally of new cases, perhaps we can take into account the overall number of new cases appearing in a 14-day sliding window. Given that symptoms from time of infection can take between 5-14 days to manifest, then a portion of daily positive cases can be the result of infection going back as far as 14-days. Taking a look at a 14-day mean average will also smooth out our data as we see below:
To accomplish the above visualization, we’ll need to perform some transformations on our dataset.
We’ll want to track observations by: - public health unit - cases in the window - window start date - window end date
# Shut down some output information from the summarise function
options(dplyr.summarise.inform = FALSE)
# 1. group our data by public health unit
covid_phu_long.df <- covid_phu_long.df %>% group_by(public_health_unit)
Error in group_by(., public_health_unit): object 'covid_phu_long.df' not found
# 2. get a complete list of case dates
case.dates <- unique(covid_phu_long.df$date)
Error in unique(covid_phu_long.df$date): object 'covid_phu_long.df' not found
# 3. set up a table to hold our summarised results
phu_window_data.df = data.frame(public_health_unit = character(0),
window_mean = numeric(0),
start_date = numeric(0), end_date = numeric(0))
case_window = 14-1
# Iterate through the dates in a 14-day sliding window
for (i in 1:(length(case.dates)-case_window)) {
curr.set <- covid_phu_long.df %>%
# Filter for a set of data that spans 14 days
filter(date %in% case.dates[i:(i + case_window)]) %>%
# Summarize that data based on public health unit
summarize(window_mean = mean(new_cases))
# Track the start and end dates of the window
curr.set$start_date = case.dates[...]
curr.set$end_date = case.dates[...]
# Add this table to the collected data
phu_window_data.df <- ...
}
Error in eval(expr, envir, enclos): object 'case.dates' not found
# Check on the final structure of the data
str(phu_window_data.df)
'data.frame': 0 obs. of 4 variables:
$ public_health_unit: chr
$ window_mean : num
$ start_date : num
$ end_date : num
Now that we’ve generated our windowed data, let’s plot the top 5 PHUs by caseload. Let’s also annotate some dates from the our pandemic history:
theme(): we can use this layer to access any number of
elements regarding the overall look/feel of our visualizationscale_*: the scale layers allow us to alter the
parameters of how our axis values are calculated or even colours of
various components!geom_text(): used to directly add text based on a mix
of variables pulled from your data or specific start/end pointsannotate(): a layer to overlay components onto your
visualization like shapes, or arrows etc.In the coming weeks we’ll be digging into the meaning of these more but for this week, it’s a bit of a trial by fire/memory.
# Build our plot and save to an object
phu_window.plot <- phu_window_data.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:5]) %>%
# redirect the filtered result to ggplot
ggplot(.) +
# 2. Aesthetics
aes(x = ..., y = ..., colour = fct_reorder(public_health_unit, ..., .desc=TRUE)) +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("Mean cases in 14-day window") +
ggtitle("Mean cases in a 14-day window across top 5 Ontario Public Health Units") +
theme(text = element_text(size = 20)) + # set text size
guides(colour = guide_legend(title="Public Health Unit")) + # set our legend name
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + # rotate our x-axis text
# 3. Scaling
# Start looking at data from March 2020 onwards
scale_x_date(limits = c(...),
date_breaks = ..., date_labels = ...) +
scale_color_viridis_d() +
# 4. Geoms
geom_line(linewidth=1.5) + # Note that "size=1.5" works here too but is deprecated
# Winter 2020 lockdown
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=2200),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Spring 2021 Lockdown
geom_text(aes(x=as.Date("2021-04-03") + 7, label = "Province-wide lockdown", y=2200),
angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2021-04-03"), xmax=as.Date("2021-04-03") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Omicron arrives
geom_text(aes(x=as.Date("2021-11-30"), label = "First Omicron\ncases reported\nin Ontario", y=1000),
hjust=1, vjust = 0, size=10, colour="black") +
annotate("segment", x=as.Date("2021-11-28"), xend = as.Date("2021-11-28"),
y=800, yend=100, colour="red", linewidth = 2, arrow = arrow()) +
# Ontario ends proper PCR testing
geom_text(aes(x=as.Date("2022-03-01"), label = "Ontario reduces public\nPCR COVID-19 testing", y=2500),
hjust=0, size=10, colour="black") +
annotate("segment", x=as.Date("2022-03-01"), xend = as.Date("2021-12-31"),
y=2500, yend=2500, colour="red", linewidth = 2, arrow = arrow())
[1m[33mError[39m in `filter()`:[22m
[1m[22m[36mi[39m In argument: `public_health_unit %in% phu_by_total_cases_desc[1:5]`.
[1mCaused by error in `public_health_unit %in% phu_by_total_cases_desc[1:5]`:[22m
[33m![39m object 'phu_by_total_cases_desc' not found
# plot our object to standard output
phu_window.plot
Error in eval(expr, envir, enclos): object 'phu_window.plot' not found
# If you wanted to save your plot:
# ggsave(phu_window.plot, file="images/top5_PHU_cases_14d-window.png", scale=1, device = "png", units = c("in"), width=20, height=10)
One of the last things we want to cover before wrapping up is the importance of grouping your data and summarizing it. This paradigm is often a simple and powerful way to generate summary information about your various data groups/experiments.
Looking back at our last data series, it was noted that after December 2022, the metrics concerning new case counts became unreliable due to a reduction in COVID-19 testing of the public. Instead, due to the influx of cases, it became more accurate to monitor metrics like hospitalizations and COVID waste water signal.
To this end, let’s look at the COVID hospitalization data by
importing region_hospital_ic_covid_data.csv from our data
folder. This gives us an idea of the stress being applied to the
healthcare system and can also give us an idea of how severe the
pandemic may be from wave to wave.
# Import the hospitalization data
covid_hospitalizations <- read_csv(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look
str(covid_hospitalizations)
Error in str(covid_hospitalizations): object 'covid_hospitalizations' not found
# How many regions are there?
unique(covid_hospitalizations$oh_region)
Error in unique(covid_hospitalizations$oh_region): object 'covid_hospitalizations' not found
group_by() and summarize()
paradigm to analyse dataSo it looks like our hospitalization data begins around April 2020
and includes multiple metrics involving the status of ICU patients bu
also the number of current hospitalizations. There is also a variable
oh_region which should denote the health region from which
the data is sampled.
The 5 regions reported can vary in size and resources but we can
combine these regions into single values to look at the overall number
of hospitalizations on a daily basis. To accomplish this feat we’ll turn
to the group_by() and summarize()
functions.
The key to using these is to identify the goals of your analysis. In
the current case, we want to combine all 5 health regions into a
singular one based on the date variable. From
there we wish to calculate the sum() of each group
on a variable like hospitalizations.
# Pass the hospitalization data
covid_hospitalizations %>%
# Group the data by DATE
group_by(...) %>%
# Summarize each group as a sum of "hospitalizations"
summarize(...) %>%
# Take a look at the data
head()
Error in head(.): '...' used in an incorrect context
Now that we know how to summarize the data, we can work on visualing
it. For the purposes of comparison, we can reuse our code from before
and simply substitute in the new parameters for our visualizations (ie
x and y values).
# Pass the hospitalization data
covid_hospitalizations %>%
# Group the data by DATE
group_by(date) %>%
# Summarize each group as a sum of "hospitalizations"
summarize(all_curr_hospitalizations = sum(hospitalizations)) %>%
# redirect the filtered result to ggplot
ggplot(.) +
### 2. Aesthetics: update the x and y sources
aes(x = ..., y = ...) +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("COVID hospitalizations") +
ggtitle("Current hospitalizations per day across Ontario") +
theme(text = element_text(size = 20)) + # set text size
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + # rotate our x-axis text
# 3. Scaling
# Start looking at data from March 2020 onwards
scale_x_date(limits = c(as.Date("2020-03-01"), as.Date(max(phu_window_data.df$start_date))),
date_breaks = "1 month", date_labels = "%b-%Y") +
scale_color_viridis_d() +
# 4. Geoms
geom_line(linewidth=1.5) +
# Winter 2020 lockdown
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=1800),
angle=90, hjust = 0, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Spring 2021 Lockdown
geom_text(aes(x=as.Date("2021-04-03") + 7, label = "Province-wide lockdown", y=1800),
angle=90, hjust = 0, size=10, colour="black") +
annotate("rect", xmin=as.Date("2021-04-03"), xmax=as.Date("2021-04-03") + 14,
ymin=-Inf, ymax=Inf, fill="red", alpha=0.2) +
# Omicron arrives
geom_text(aes(x=as.Date("2021-11-30"), label = "First Omicron\ncases reported\nin Ontario", y=1500),
hjust=1, vjust = 0, size=10, colour="black") +
annotate("segment", x=as.Date("2021-11-28"), xend = as.Date("2021-11-28"),
y=1200, yend=500, colour="red", linewidth = 2, arrow = arrow()) +
# Ontario ends proper PCR testing
geom_text(aes(x=as.Date("2022-03-01"), label = "Ontario reduces public\nPCR COVID-19 testing", y=2500),
hjust=0, size=10, colour="black") +
annotate("segment", x=as.Date("2022-03-01"), xend = as.Date("2021-12-31"),
y=2500, yend=2500, colour="red", linewidth = 2, arrow = arrow())
Error in group_by(., date): object 'covid_hospitalizations' not found
Well it looks like our hospitalization data tells a different story from the case report data! Something worth exploring in your assignment!
That’s our first class! If we’ve made it this far, we’ve reviewed 1.
Foundational concepts in R 2. Helpful functions in generating tidy data
for analysis 3. Basics of visualizations using the
ggplot2
We took a “messy” dataset from the Ontario government and created a tidy data set that we were able to visualize. We took that further by transforming the data into a 14-day sliding window of mean new cases per day in each public health unit. This clarified our picture of cases and visually confirmed that spread of SARS-CoV-2 did appear to be mitigated through lockdown orders.
Next week? Getting deeper into ggplot2!
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 14th, 2024.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
lubridate package: https://r4ds.had.co.nz/dates-and-times.htmlAs of 2022-03-01, the latest stable R version is 4.2.1:
Windows:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for Windows’
- Click on ‘install R for the first time’
- Click on ‘Download R 4.2.1 for Windows’ (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the
instructions.
(Mac) OS X:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for (Mac) OS X’
- Click on R-4.2.1 .pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the
instructions.
Linux:
- Open a terminal (Ctrl + alt + t) - sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from
source)
As of 2023-03-01, the latest RStudio version is 2022.12.0+353 (released 2022-12-15)
Windows (10/11):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2022.12.0-353.EXE’ to download the installer (or a
newer version)
- Double-click on the .exe file once it has downloaded and follow the
instructions.
(Mac) OS X (11+):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2022.12.0-353.DMG’ to download the installer (or a
newer version)
- Double-click on the .dmg file once it has downloaded and follow the
instructions.
Linux:
- Go to https://posit.co/downloads/
- Click on the installer that describes your Linux distribution,
e.g. ‘RSTUDIO-2022.12.0-353-AMD64.DEB’ (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the
instructions.
- If double-clicking on your .deb file did not open the software
manager, open the terminal (Ctrl + alt + t) and type sudo dpkg
-i /path/to/installer/RSTUDIO-2022.12.0-353-AMD64.deb
_Note: You have 3 things that could change in this last command._
1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)
2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).
3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).
If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.
RStudio is an IDE (Integrated Development Environment) for R that
provides a more user-friendly experience than using R in a terminal
setting. It has 4 main areas or panes, which you can customize to some
extent under
Tools > Global Options > Pane Layout:
All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.
The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. ‘Untitled.R’), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.
To run your current line of code or a highlighted segment of code
from the Source pane you can:
a) click the button
'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu
bar,
c) use the keyboard shortcut CTRL + ENTER (Windows &
Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter
(not recommended).
There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.
You can also type and execute your code (by hitting
ENTER) in the Console when the
> prompt is visible. If you enter code and you see a
+ instead of a prompt, R doesn’t think you are finished
entering code (i.e. you might be missing a bracket). If this isn’t
immediately fixable, you can hit Esc twice to get back to
your prompt. Using the up and down arrow keys, you can find previous
commands in the Console if you want to rerun code or fix an error
resulting from a typo.
On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.
In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.
Objects are made by using the assignment operator
<-. On the left side of the arrow, you have the name of
your object. On the right side you have what you are assigning to that
object. In this sense, you can think of an object as a container. The
container holds the values given as well as information about ‘class’
and ‘methods’ (which we will come back to).
Type x <- c(2,4) in the Console followed by
Enter. 1D objects’ data types can be seen immediately as
well as their first few values. Now type
y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c"))
in the Console followed by Enter. You can immediately see
the dimension of 2D objects, and you can check the structure of data
frames and lists (more later) by clicking on the object’s arrow.
Clicking on the object name will open the object to view in a new tab.
Custom functions created in session or sourced will also appear in this
pane.
The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).
In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.
The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.
The Files tab allows you to search through directories; you can go to
or set your working directory by making the appropriate selection under
the More (blue gear) drop-down menu. The ...
to the top left of the pane allows you to search for a folder in a more
traditional manner.
The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.
The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.
The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.
The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.
I suggest you take a look at Tools -> Global Options
to customize your experience.
For example, under Code -> Editing I have selected
Soft-wrap R source files followed by Apply so
that my text will wrap by itself when I am typing and not create a long
line of text.
You may also want to change the Appearance of your code.
I like the RStudio theme: Modern and
Editor font: Ubuntu Mono, but pick whatever you like!
Again, you need to hit Apply to make changes.
That whirlwind tour isn’t everything the IDE can do, but it is enough to get started.
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.